Running R in parallel on the TACC

James E. Pustejovsky

UPDATE (4/8/2014): I have learned from Mr. Yaakoub El Khamra that he and the good folks at TACC have made some modifications to TACC’s custom MPI implementation and R build in order to correct bugs in Rmpi and snow that were causing crashes. This post has been updated to reflect the modifications.

I’ve started to use the Texas Advanced Computing Cluster to run statistical simulations in R. It takes a little bit of time to get up and running, but once you do it is an amazing tool. To get started, you’ll need

An account on the TACC and an allocation of computing time.
An ssh client like PUTTY.
Some R code that can be adapted to run in parallel.
A SLURM script that tells the server (called Stampede) how to run the R.

The R script

I’ve been running my simulations using a combination of several packages that provide very high-level functionality for parallel computing, namely foreach, doSNOW, and the maply function in plyr. All of this runs on top of an Rmpi implementation developed by the folks at TACC (more details here).

In an earlier post, I shared code for running a very simple simulation of the Behrens-Fisher problem. Here’s adapted code for running the same simulation on Stampede. The main difference is that there are a few extra lines of code to set up a cluster, seed a random number generator, and pass necessary objects (saved in source_func) to the nodes of the cluster:

Code

library(Rmpi)
library(snow)
library(foreach)
library(iterators)
library(doSNOW)
library(plyr)

# set up parallel processing
cluster <- getMPIcluster()
registerDoSNOW(cluster)

# export source functions
clusterExport(cluster, source_func)

Once it is all set up, running the code is just a matter of turning on the parallel option in mdply:

Code

BFresults <- mdply(parms, .fun = run_sim, .drop=FALSE, .parallel=TRUE)

I fully admit that my method of passing source functions is rather kludgy. One alternative would be to save all of the source functions in a separate file (say, source_functions.R), then source the file at the beginning of the simulation script:

Code

rm(list=ls())
source("source_functions.R")
print(source_func <- ls())

Another, more elegant alternative would be to put all of your source functions in a little package (say, BehrensFisher), install the package, and then pass the package in the maply call:

Code

BFresults <- mdply(parms, .fun = run_sim, .drop=FALSE, .parallel=TRUE, .paropts = list(.packages="BehrensFisher"))

Of course, developing a package involves a bit more work on the front end.

The SLURM script

Suppose that you’ve got your R code saved in a file called Behrens_Fisher.R. Here’s an example of a SLURM script that runs the R script after configuring an Rmpi cluster:

Code

#!/bin/bash
#SBATCH -J Behrens          # Job name
#SBATCH -o Behrens.o%j      # Name of stdout output file (%j expands to jobId)
#SBATCH -e Behrens.o%j      # Name of stderr output file(%j expands to jobId)
#SBATCH -n 32               # Total number of mpi tasks requested
#SBATCH -p normal           # Submit to the 'normal' or 'development' queue
#SBATCH -t 0:20:00          # Run time (hh:mm:ss)
#SBATCH -A A-yourproject    # Allocation name to charge job against
#SBATCH --mail-user=you@email.address # specify email address for notifications
#SBATCH --mail-type=begin   # email when job begins
#SBATCH --mail-type=end     # email when job ends

# load R module
module load Rstats           

# call R code from RMPISNOW
ibrun RMPISNOW < Behrens_Fisher.R

The file should be saved in a plain text file called something like run_BF.slurm. The file has to use ANSI encoding and Unix-type end-of-line encoding; Notepad++ is a text editor that can create files in this format.

Note that for full efficiency, the -n option should be a multiple of 16 because their are 16 cores per compute node. Further details about SBATCH options can be found here.

Running on Stampede

Follow these directions to log in to the Stampede server. Here’s the User Guide for Stampede. The first thing you’ll need to do is ensure that you’ve got the proper version of MVAPICH loaded. To do that, type

Code

module swap intel intel/14.0.1.106
module setdefault

The second line sets this as the default, so you won’t need to do this step again.

Second, you’ll need to install whatever R packages you’ll need to run your code. To do that, type the following at the login4$ prompt:

Code

login4$module load Rstats
login4$R

This will start an interactive R session. From the R prompt, use install.packages to download and install, e.g.

Code

install.packages("plyr","reshape","doSNOW","foreach","iterators")

The packages will be installed in a local library. Now type q() to quit R.

Next, make a new directory for your project:

Code

login4$mkdir project_name
login4$cd project_name

Upload your files to the directory (using psftp, for instance). Check that your R script is properly configured by viewing it in Vim.

Finally, submit your job by typing

Code

login4$sbatch run_BF.slurm

or whatever your SLURM script is called. To check the status of the submitted job, type showq -u followed by your TACC user name (more details here).

Further thoughts

TACC accounts come with a limited number of computing hours, so you should be careful to write efficient code. Before you even start worrying about running on TACC, you should profile your code and try to find ways to speed up the computations. (Some simple improvements in my Behrens-Fisher code would make it run MUCH faster.) Once you’ve done what you can in terms of efficiency, you should do some small test runs on Stampede. For example, you could try running only a few iterations for each combination of factors, and/or running only some of the combinations rather than the full factorial design. Based on the run-time for these jobs, you’ll then be able to estimate how long the full code would take. If it’s acceptable (and within your allocation), then go ahead and sbatch the full job. If it’s not, you might reconsider the number of factor levels in your design or the number of iterations you need. I might have more comments about those some other time.

Comments? Suggestions? Corrections? Drop a comment.

--- title: Running R in parallel on the TACC date: '2013-12-20' categories: - Rstats - programming - simulation - TACC code-tools: true --- UPDATE (4/8/2014): I have learned from [Mr. Yaakoub El Khamra](https://www.tacc.utexas.edu/staff/yaakoub-el-khamra) that he and the good folks at TACC have made some modifications to TACC's custom MPI implementation and R build in order to correct bugs in Rmpi and snow that were causing crashes. This post [has been updated](/posts/parallel-R-on-TACC-update) to reflect the modifications. I've started to use the Texas Advanced Computing Cluster to run statistical simulations in R. It takes a little bit of time to get up and running, but once you do it is an amazing tool. To get started, you'll need 1. An account on the [TACC](https://www.tacc.utexas.edu/) and an allocation of computing time. 2. An ssh client like [PUTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/). 3. Some R code that can be adapted to run in parallel. 4. A SLURM script that tells the server (called Stampede) how to run the R. ### The R script I've been running my simulations using a combination of several packages that provide very high-level functionality for parallel computing, namely `foreach`, `doSNOW`, and the `maply` function in `plyr`. All of this runs on top of an `Rmpi` implementation developed by the folks at TACC ([more details here](https://portal.tacc.utexas.edu/documents/13601/901835/Parallel_R_Final.pdf/)). In [an earlier post](/posts/Designing-simulation-studies-using-R/), I shared code for running a very simple simulation of the Behrens-Fisher problem. Here's [adapted code](https://gist.github.com/jepusto/8059893) for running the same simulation on Stampede. The main difference is that there are a few extra lines of code to set up a cluster, seed a random number generator, and pass necessary objects (saved in `source_func`) to the nodes of the cluster: ```{r, eval=FALSE} library(Rmpi) library(snow) library(foreach) library(iterators) library(doSNOW) library(plyr) # set up parallel processing cluster <- getMPIcluster() registerDoSNOW(cluster) # export source functions clusterExport(cluster, source_func) ``` Once it is all set up, running the code is just a matter of turning on the parallel option in `mdply`: ```{r, eval=FALSE} BFresults <- mdply(parms, .fun = run_sim, .drop=FALSE, .parallel=TRUE) ``` I fully admit that my method of passing source functions is rather kludgy. One alternative would be to save all of the source functions in a separate file (say, `source_functions.R`), then `source` the file at the beginning of the simulation script: ```{r, eval=FALSE} rm(list=ls()) source("source_functions.R") print(source_func <- ls()) ``` Another, more elegant alternative would be to put all of your source functions in a little package (say, `BehrensFisher`), install the package, and then pass the package in the `maply` call: ```{r, eval=FALSE} BFresults <- mdply(parms, .fun = run_sim, .drop=FALSE, .parallel=TRUE, .paropts = list(.packages="BehrensFisher")) ``` Of course, developing a package involves a bit more work on the front end. ### The SLURM script Suppose that you've got your R code saved in a file called `Behrens_Fisher.R`. Here's an example of a SLURM script that runs the R script after configuring an Rmpi cluster: ```{r, engine="bash", eval = FALSE} #!/bin/bash #SBATCH -J Behrens # Job name #SBATCH -o Behrens.o%j # Name of stdout output file (%j expands to jobId) #SBATCH -e Behrens.o%j # Name of stderr output file(%j expands to jobId) #SBATCH -n 32 # Total number of mpi tasks requested #SBATCH -p normal # Submit to the 'normal' or 'development' queue #SBATCH -t 0:20:00 # Run time (hh:mm:ss) #SBATCH -A A-yourproject # Allocation name to charge job against #SBATCH --mail-user=you@email.address # specify email address for notifications #SBATCH --mail-type=begin # email when job begins #SBATCH --mail-type=end # email when job ends # load R module module load Rstats # call R code from RMPISNOW ibrun RMPISNOW < Behrens_Fisher.R ``` The file should be saved in a plain text file called something like `run_BF.slurm`. The file has to use ANSI encoding and Unix-type end-of-line encoding; [Notepad++](http://notepad-plus-plus.org/) is a text editor that can create files in this format. Note that for full efficiency, the `-n` option should be a multiple of 16 because their are 16 cores per compute node. Further details about SBATCH options can be found [here](https://portal.tacc.utexas.edu/user-guides/stampede#running-slurm-jobcontrol). ### Running on Stampede [Follow these directions](https://portal.tacc.utexas.edu/user-guides/stampede#access) to log in to the Stampede server. Here's the [User Guide](https://portal.tacc.utexas.edu/user-guides/stampede) for Stampede. The first thing you'll need to do is ensure that you've got the proper version of MVAPICH loaded. To do that, type ```{r, engine="bash", eval = FALSE} module swap intel intel/14.0.1.106 module setdefault ``` The second line sets this as the default, so you won't need to do this step again. Second, you'll need to install whatever R packages you'll need to run your code. To do that, type the following at the `login4$` prompt: ```{r, engine="bash", eval = FALSE} login4$module load Rstats login4$R ``` This will start an interactive R session. From the R prompt, use `install.packages` to download and install, e.g. ```{r, eval=FALSE} install.packages("plyr","reshape","doSNOW","foreach","iterators") ``` The packages will be installed in a local library. Now type `q()` to quit R. Next, make a new directory for your project: ```{r, engine="bash", eval = FALSE} login4$mkdir project_name login4$cd project_name ``` Upload your files to the directory (using [psftp](http://the.earth.li/~sgtatham/putty/0.63/htmldoc/Chapter6.html), for instance). Check that your R script is properly configured by viewing it in Vim. Finally, submit your job by typing ```{r, engine="bash", eval = FALSE} login4$sbatch run_BF.slurm ``` or whatever your SLURM script is called. To check the status of the submitted job, type `showq -u` followed by your TACC user name (more details [here](https://portal.tacc.utexas.edu/user-guides/stampede#running-slurm-jobcontrol-squeue)). ### Further thoughts TACC accounts come with a limited number of computing hours, so you should be careful to write efficient code. Before you even start worrying about running on TACC, you should profile your code and try to find ways to speed up the computations. (Some simple improvements in my Behrens-Fisher code would make it run MUCH faster.) Once you've done what you can in terms of efficiency, you should do some small test runs on Stampede. For example, you could try running only a few iterations for each combination of factors, and/or running only some of the combinations rather than the full factorial design. Based on the run-time for these jobs, you'll then be able to estimate how long the full code would take. If it's acceptable (and within your allocation), then go ahead and `sbatch` the full job. If it's not, you might reconsider the number of factor levels in your design or the number of iterations you need. I might have more comments about those some other time. Comments? Suggestions? Corrections? Drop a comment.